{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "fda7b261403818bf444379d71956dbde60c04c41"
   },
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "c0b1a7ed1c896a7ccd7dd41eb08795fcc9205bfc"
   },
   "source": [
    "**Research plan**\n",
    " 1. Feature and data explanation \n",
    " 2. Primary data analysis\n",
    " 3. Primary visual data analysis\n",
    " 4. Insights and found dependencies\n",
    " 5. Metrics selection\n",
    " 6. Model selection\n",
    " 7. Data preprocessing\n",
    " 8. Cross-validation and adjustment of model hyperparameters\n",
    " 9. Creation of new features and description of this process\n",
    " 10. Plotting training and validation curves\n",
    " 11. Prediction for test or hold-out samples \n",
    " 12. Conclusions "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "58f1ec3bf056e7e7706bd564035676741b8ade35"
   },
   "source": [
    "### Part 1. Feature and data explanation \n",
    "\n",
    "There are provided hourly rental data spanning two years. For this [competition](https://www.kaggle.com/c/comp180bikeshare), the [training set](https://www.kaggle.com/c/comp180bikeshare/data) is comprised of the first 16 days of each month, while the test set is the 17-19th day of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period. That is, predict \"count\" without using \"count\" or it's components \"casual\" and \"registered\".\n",
    "\n",
    "**Data Fields**\n",
    "\n",
    "* *datetime* - hourly date + timestamp\n",
    "* *season* - 1 = spring, 2 = summer, 3 = fall, 4 = winter\n",
    "* *holiday* - whether the day is considered a holiday\n",
    "* *workingday* - whether the day is neither a weekend nor holiday\n",
    "* *weather* - \n",
    "    1.  Clear, Few clouds, Partly cloudy, Partly cloudy\n",
    "    2.  Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist\n",
    "    3.  Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds\n",
    "    4.  Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog\n",
    "* *temp* - temperature in Celsius\n",
    "* *atemp* - \"feels like\" temperature in Celsius\n",
    "* *humidity* - relative humidity\n",
    "* *windspeed* - wind speed\n",
    "* *casual* - number of non-registered user rentals initiated\n",
    "* *registered* - number of registered user rentals initiated\n",
    "* *count* - number of total rentals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "940b76d03f710ba594cf843ff2f8b33e55a96657"
   },
   "source": [
    "### Part 2. Primary data analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
   },
   "outputs": [],
   "source": [
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
    "import os\n",
    "import gc\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.model_selection import train_test_split, cross_val_score\n",
    "from catboost import CatBoostRegressor, Pool, cv\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "import seaborn as sns\n",
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Fixing random seed\n",
    "np.random.seed(17)\n",
    "\n",
    "print(os.listdir(\"../input\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
    "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
   },
   "outputs": [],
   "source": [
    "# Fixing random seed\n",
    "np.random.seed(17)\n",
    "# Read data\n",
    "data_df = pd.read_csv('../input/train_luc.csv')\n",
    "\n",
    "# Convert to datetime\n",
    "data_df['datetime'] = pd.to_datetime(data_df['datetime'])\n",
    "\n",
    "# Sort by datetime\n",
    "data_df.sort_values(by='datetime')\n",
    "\n",
    "# Look at the first rows of the training set\n",
    "data_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "c91695f66ef8d56667e31fca901bb4fb18148ab0"
   },
   "outputs": [],
   "source": [
    "data_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "02a68535835ed033f0671bfd5dd27ceb2fff2852"
   },
   "source": [
    "*Train contains 3 target columns:  ** 'casual', 'registred', 'count'.**\n",
    "Column 'count' is the sum of columns 'casual' and 'registred'. Check it:*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "37a98379c9af9a0fd1c6f4c9f377fa7d43db8d38"
   },
   "outputs": [],
   "source": [
    "(data_df['casual'] + data_df['registered'] - data_df['count']).value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "90117f0ab2747a3d0eb049b73204f21851a1d959"
   },
   "outputs": [],
   "source": [
    "# Get info by train\n",
    "data_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "ec94e47c80912e90d8a724a525b2156e0d91b17f"
   },
   "source": [
    "Excelent! We have not NaN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b7dc7f6e52df63e0e7cf9ce244e2ea92f721f9ee"
   },
   "outputs": [],
   "source": [
    "# Get statistics by train_df\n",
    "data_df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "7dc55202c7b5aed3052edd7aad5c1c87eb54e50f"
   },
   "source": [
    "### Part 3. Primary visual data analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "b920c43275c8351ead114f724bedd2bb87ba3100"
   },
   "source": [
    "I split data into train and hold-out samples "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "965d15f4b30d088ecf1d07eaeeeac78e9acb3a14"
   },
   "outputs": [],
   "source": [
    "train_df, test_df, y_train, y_test = train_test_split(data_df.drop(['casual', 'registered', 'count'], axis=1), data_df[['casual', 'registered', 'count']], \n",
    "                                                      test_size=0.3, random_state=17, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "07a18dda77b77b74044e0e9cf7c8f914a6de07cb"
   },
   "outputs": [],
   "source": [
    "def draw_train_test_distribution(column):\n",
    "    _, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,3))\n",
    "    sns.distplot(train_df[column], ax = axes[0], label='train')\n",
    "    sns.distplot(test_df[column], ax = axes[1], label='test');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "691a4a7e0df12197e411200857a599ac1d568c8c"
   },
   "outputs": [],
   "source": [
    "# The distribution of the indicative features\n",
    "draw_train_test_distribution('season')\n",
    "draw_train_test_distribution('holiday')\n",
    "draw_train_test_distribution('workingday')\n",
    "draw_train_test_distribution('weather');\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "3fd4d473e0876872938b27267118b56ef5ec4ec3"
   },
   "source": [
    "The distribution of the indicative features ('season', 'holiday', 'workingday', 'weather') on the train and test coincides. There is 'weather' with value 4 on the test and absent on the train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "ec9a10e9eac94042047a1512a6b6d8bb4ca5e276"
   },
   "outputs": [],
   "source": [
    "test_df[test_df['weather'] == 4]['weather'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "a2dce21ca0dd976b1495b14f7871ae5a10a774cf"
   },
   "source": [
    "Only 1 exampe contains 'weather'=4.  This can be neglected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "44b4df71dad89a784874950f1983a7d1a262214e"
   },
   "outputs": [],
   "source": [
    "# The distribution of the numerical features on the train and test\n",
    "draw_train_test_distribution('temp')\n",
    "draw_train_test_distribution('atemp')\n",
    "draw_train_test_distribution('humidity')\n",
    "draw_train_test_distribution('windspeed')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "21255c9c789b00935df2fe9649e91c034cfd6b2f"
   },
   "source": [
    "The distribution of the scale features ('temp', 'atemp', 'humidity', 'windspeed') on the train and test coincides"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "995d82d1ba35b7aab5b82f50039182a7ab098944"
   },
   "outputs": [],
   "source": [
    "def transformation(columnName, func = np.log1p):\n",
    "    temp_train = pd.DataFrame(index=train_df.index)\n",
    "    temp_train[columnName] = train_df[columnName].apply(func)\n",
    "\n",
    "    temp_test = pd.DataFrame(index=test_df.index)\n",
    "    temp_test[columnName] = test_df[columnName].apply(func)\n",
    "    \n",
    "    _, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,3))\n",
    "    sns.distplot(temp_train, ax = axes[0])\n",
    "    sns.distplot(temp_test, ax = axes[1]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "3c72f7ad73a4e00c3a220dbb54953136788cd4dd"
   },
   "outputs": [],
   "source": [
    "transformation('temp')\n",
    "transformation('atemp')\n",
    "transformation('humidity')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "4d850160d357dca906d6ab40b21c4f0c9f66ceb1"
   },
   "source": [
    "After the transformation of the distribution on the train_df and on the test_df began to resemble each other more"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "37debd3896c5999b1ec275d499db2834ad795630"
   },
   "outputs": [],
   "source": [
    "train_df['temp_tr'] = train_df['temp'].apply(np.log1p)\n",
    "test_df['temp_tr'] = test_df['temp'].apply(np.log1p)\n",
    "train_df['atemp_tr'] = train_df['atemp'].apply(np.log1p)\n",
    "test_df['atemp_tr'] = test_df['atemp'].apply(np.log1p)\n",
    "train_df['humidity_tr'] = train_df['humidity'].apply(np.log1p)\n",
    "test_df['humidity_tr'] = test_df['humidity'].apply(np.log1p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "910dbf9ceb6db50bba28df0705ed17e8b588faa3"
   },
   "source": [
    "### Part 4. Insights and found dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "8c518a0c1e027d21c2a7e4658f71849b45451520"
   },
   "outputs": [],
   "source": [
    "corr = train_df.join(y_train).corr('spearman')\n",
    "plt.figure(figsize = ( 12 , 10 ))\n",
    "sns.heatmap(corr,annot=True,fmt='.2f',cmap=\"YlGnBu\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "994f9cd1c81739614fc2fb352bba42c5b96418e8"
   },
   "source": [
    "    We can see a strong correlation between target columns (What was expected). Among customers, correlation with registered users is higher than with unregistered ones. In this case, unregistered looking more at the weather.\n",
    "    Correlations temp/temp_tr, atemp/atemp_tr, humidity/humidity_tr are equals. I'll use *_tr features, because their distibutions into train/test are similar.\n",
    "    **Idea: Build an ensemble using 3 models with different target!**\n",
    "    'Holiday' has a low correlation with targets. I'll not use it.\n",
    "    'Workingday' has a lower correlation with 'registered' and 'count' than with 'casual'\n",
    "    'Windspeed' affects both registered and unregistered users\n",
    "    The effect of 'temp' and 'atemp' is the same.\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "bd5383604af6b6b4ef0fea43d0aeb21fc987a876"
   },
   "source": [
    "### Part 5. Metrics selection\n",
    "According to the terms of [competition](https://www.kaggle.com/c/comp180bikeshare#evaluation) i will use Root Mean Squared Error (RMSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "7f738b56e31de4b952a99484080c60940a869773"
   },
   "source": [
    "### Part 6. Model selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "e21693835a5b83a8076db71ea0c6d0859aca9ec1"
   },
   "source": [
    "The [course](https://www.coursera.org/learn/competitive-data-science) recommend using [Gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) model as one of the most powerful. I want use Catboost library, because i want to understand it. From catboost library i 'll use CatBoostRegression, because current task is related to regression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "74f35619d33b232901dff44e15af457eee048f48"
   },
   "source": [
    "### Part 7. Data preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "64864195c530fbd8f64a463342ce99b1ee84eae0"
   },
   "source": [
    "* I changed 'temp',  'atemp', 'humidity' features in part 3. \n",
    "* NaN is absent. \n",
    "* I will not use OHE because the CatBoostRegressor takes a list of categorical features as a parameter\n",
    "* Scaling data in CatBoost will be auto"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "8c8c8f409aeba7833aaa33d4c0c36ca7cc26781c"
   },
   "outputs": [],
   "source": [
    "X_train = train_df.drop(['holiday', 'datetime', 'temp', 'atemp', 'humidity'], axis=1)\n",
    "X_test = test_df.drop(['holiday', 'datetime', 'temp', 'atemp', 'humidity'], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "c75753bedac50caf2c38616217832e13eda01885"
   },
   "source": [
    "### Part 8. Cross-validation and adjustment of model hyperparameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "3c8d7b2100b0ee572bde036a9494934cb9b7e696"
   },
   "outputs": [],
   "source": [
    "cat_features = [0, 1, 2]\n",
    "\n",
    "X_train_cbr, X_test_cbr, y_train_cbr, y_test_cbr = train_test_split(X_train, y_train, test_size=0.3, random_state=17, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "9999cfeffa8006f4e297bb8713f94a1729f50c3b"
   },
   "outputs": [],
   "source": [
    "from hyperopt import hp, fmin, tpe, STATUS_OK, Trials\n",
    "import colorama\n",
    "\n",
    "# the number of random sets of hyperparameters\n",
    "N_HYPEROPT_PROBES = 100\n",
    "\n",
    "# hyperparameter sampling algorithm\n",
    "HYPEROPT_ALGO = tpe.suggest\n",
    "\n",
    "space ={\n",
    "        'depth': hp.quniform(\"depth\", 4, 10, 1),\n",
    "        'learning_rate': hp.loguniform('learning_rate', -3.0, -0.7),\n",
    "        'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1, 10),\n",
    "       }\n",
    "\n",
    "def get_catboost_params(space):\n",
    "    params = dict()\n",
    "    params['learning_rate'] = space['learning_rate']\n",
    "    params['depth'] = int(space['depth'])\n",
    "    params['l2_leaf_reg'] = space['l2_leaf_reg']\n",
    "    return params\n",
    "\n",
    "def objective(space, target_column='count'):\n",
    "    global obj_call_count, cur_best_rmse\n",
    "\n",
    "    obj_call_count += 1\n",
    "\n",
    "    print('\\nCatBoost objective call #{} cur_best_acc={:7.5f}'.format(obj_call_count, cur_best_rmse) )\n",
    "\n",
    "    params = get_catboost_params(space)\n",
    "\n",
    "    sorted_params = sorted(space.items(), key=lambda z: z[0])\n",
    "    params_str = str.join(' ', ['{}={}'.format(k, v) for k, v in sorted_params])\n",
    "    print('Params: {}'.format(params_str) )\n",
    "\n",
    "    model = CatBoostRegressor(iterations=2000,\n",
    "                              cat_features = cat_features,\n",
    "                            learning_rate=params['learning_rate'],\n",
    "                            depth=int(params['depth']),\n",
    "                            use_best_model=True,\n",
    "                            eval_metric='RMSE',\n",
    "                            l2_leaf_reg=params['l2_leaf_reg'],\n",
    "                            early_stopping_rounds=50,\n",
    "                            random_seed=17,\n",
    "                            verbose=False\n",
    "                            )\n",
    "    model.fit(X_train_cbr, y_train_cbr[target_column], \n",
    "              eval_set=(X_test_cbr, y_test_cbr[target_column]), \n",
    "              verbose=False)\n",
    "    nb_trees = model.get_best_iteration()\n",
    "\n",
    "    print('nb_trees={}'.format(nb_trees))\n",
    "\n",
    "    y_pred = model.predict(X_test_cbr)\n",
    "\n",
    "    rmse = np.sqrt(mean_squared_error(y_test_cbr[target_column], y_pred))\n",
    "\n",
    "    print('rmse={}, Params:{}, nb_trees={}\\n'.format(rmse, params_str, nb_trees ))\n",
    "\n",
    "    if rmse<cur_best_rmse:\n",
    "        cur_best_rmse = rmse\n",
    "        print(colorama.Fore.GREEN + 'NEW BEST RMSE={}'.format(cur_best_rmse) + colorama.Fore.RESET)\n",
    "\n",
    "\n",
    "    return{'loss':rmse, 'status': STATUS_OK }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "1da1c828ec884d553f701badaaf3cc558563650a"
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "obj_call_count = 0\n",
    "cur_best_rmse = np.inf\n",
    "\n",
    "trials = Trials()\n",
    "best = fmin(fn=objective,\n",
    "                     space=space,\n",
    "                     algo=HYPEROPT_ALGO,\n",
    "                     max_evals=N_HYPEROPT_PROBES,\n",
    "                     trials=trials,\n",
    "                     verbose=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "f59a901f9dcc8b574413638945b7a429873f8746"
   },
   "outputs": [],
   "source": [
    "print('The best params:')\n",
    "print( best )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "7408ab254f5f8266475ba39f915a81ccdea5f665"
   },
   "outputs": [],
   "source": [
    "cbr = CatBoostRegressor(random_seed=17, \n",
    "                        eval_metric='RMSE', \n",
    "                        iterations=2000, \n",
    "                        max_depth=best['depth'], \n",
    "                        early_stopping_rounds=50, \n",
    "                        learning_rate=best['learning_rate'],\n",
    "                        l2_leaf_reg=best['l2_leaf_reg'],\n",
    "                       use_best_model=True)\n",
    "cbr.fit(X_train_cbr, y_train_cbr['count'], \n",
    "       eval_set=(X_test_cbr, y_test_cbr['count']), \n",
    "       cat_features=cat_features,\n",
    "       silent=True,\n",
    "       plot=True);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "9f36c42f30b62fa1d3a8c1b505d314caf7538702"
   },
   "outputs": [],
   "source": [
    "rmse_learn = cbr.evals_result_['learn']['RMSE']\n",
    "rmse_test = cbr.evals_result_['validation_0']['RMSE']\n",
    "\n",
    "plt.plot(rmse_learn)\n",
    "plt.plot(rmse_test)\n",
    "plt.title('RMSE on train/test data')\n",
    "plt.xlabel('trees count')\n",
    "plt.ylabel('rmse value')\n",
    "plt.legend(['leanr', 'test']);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "d73a41604cc10abb525a8be7934316051b382445"
   },
   "outputs": [],
   "source": [
    "#Get important features\n",
    "cbr.feature_importances_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "4d49e1c6ad87d973331cda83cefadcfbb0a03f89"
   },
   "source": [
    "Most important features are 'humidity_tr', 'atemp_tr' and 'temp_tr'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "8577eefea855369fb39f6c480a85d5401d65bb8f"
   },
   "source": [
    "###  Part 9. Creation of new features and description of this process\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "59777d6662507709d8a3ed5aa2c167f790f4224a"
   },
   "source": [
    "Add new features from 'datetime'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "f959bf48ef2b229c27dcb462ca0510587e3cfa3b"
   },
   "outputs": [],
   "source": [
    "train_df['hour'] = train_df['datetime'].apply(lambda ts: ts.hour)\n",
    "test_df['hour'] = test_df['datetime'].apply(lambda ts: ts.hour) \n",
    "\n",
    "train_df['weekday'] = train_df['datetime'].apply(lambda ts: ts.isoweekday())\n",
    "test_df['weekday'] = test_df['datetime'].apply(lambda ts: ts.isoweekday())\n",
    "\n",
    "train_df['month'] = train_df['datetime'].apply(lambda ts: ts.month)\n",
    "test_df['month'] = test_df['datetime'].apply(lambda ts: ts.month) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "39f38f8847b81010fbfae6910eb7cb3158063489"
   },
   "outputs": [],
   "source": [
    "draw_train_test_distribution('hour')\n",
    "draw_train_test_distribution('weekday');\n",
    "draw_train_test_distribution('month');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "94e68504953b59a4825ec2317ac5eb6882d86697"
   },
   "source": [
    "Disributions in train and test datasets are equale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "177227e2763ffe4b59fe83c4e26f5c2359ee3409"
   },
   "outputs": [],
   "source": [
    "hours_mean_count = {}\n",
    "hours_mean_casual = {}\n",
    "hours_mean_registered = {}\n",
    "\n",
    "hours_mean ={}\n",
    "hours = np.unique(train_df['hour'])\n",
    "print(\"hours:\",hours)\n",
    "\n",
    "temp_train_df =  train_df.join(y_train)\n",
    "\n",
    "for h in hours:\n",
    "    hours_mean_count[h] = temp_train_df.loc[temp_train_df['hour'] == h]['count'].mean()\n",
    "    hours_mean_casual[h] = temp_train_df.loc[temp_train_df['hour'] == h]['casual'].mean()\n",
    "    hours_mean_registered[h] = temp_train_df.loc[temp_train_df['hour'] == h]['registered'].mean()\n",
    "    \n",
    "    hours_mean[h] = [temp_train_df.loc[temp_train_df['hour'] == h]['count'].mean(),\n",
    "                    temp_train_df.loc[temp_train_df['hour'] == h]['casual'].mean(),\n",
    "                    temp_train_df.loc[temp_train_df['hour'] == h]['registered'].mean()]\n",
    "    \n",
    "hours_df = pd.DataFrame.from_dict(hours_mean, orient='index',\n",
    "                        columns=['count', 'casual', 'registered'])  \n",
    "hours_df['hours'] = hours\n",
    "\n",
    "hours_df.plot(x='hours', y=['count', 'casual', 'registered'], kind='bar', \n",
    "              title = 'Measured bike use over 2 years',  legend = True );\n",
    "\n",
    "del temp_train_df\n",
    "gc.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "55b4df74d224f422e96001eb83425b3f755a639b"
   },
   "source": [
    "Ok. It makes sense to make features by the hour: night (23-0-6),  day (7-22)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "bb4d94fa65ea05497795b80213c7e59f55db6f5c"
   },
   "outputs": [],
   "source": [
    "train_df['night'] = train_df['hour'].apply(lambda h: 1 if (h>=23) | (h<=6) else 0)\n",
    "train_df['day'] = train_df['hour'].apply(lambda h: 1 if (h<23) & (h>6) else 0)\n",
    "\n",
    "test_df['night'] = test_df['hour'].apply(lambda h: 1 if (h>=23) | (h<=6) else 0)\n",
    "test_df['day'] = test_df['hour'].apply(lambda h: 1 if (h<23) & (h>6) else 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "8605f7711c7083a0ff945c9890543731baff1a7d"
   },
   "source": [
    "I like trigonimetry)) [Here](https://habr.com/company/ods/blog/325422/#data-i-vremya) and [here](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-6-feature-engineering-and-feature-selection-8b94f870706a) is a great example of how to use trigonometry. I'll modify it a bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "97e4f3552482723bf721da9331b2e18510f0180f"
   },
   "outputs": [],
   "source": [
    "def make_harmonic_features(value, period=24):\n",
    "    new_value = value * 2 * np.pi / period\n",
    "    return np.cos(new_value), np.sin(new_value)\n",
    "\n",
    "train_df['hour_cos'], train_df['hour_sin'] = make_harmonic_features(train_df['hour'])\n",
    "test_df['hour_cos'], test_df['hour_sin'] = make_harmonic_features(test_df['hour'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "188ce62e9fd51852867180700d60ff3a32d20266"
   },
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "5e163408f078fa54ae4db00a2c4e1b865188b5e1"
   },
   "source": [
    "### Part 10. Plotting training and validation curves"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "66d199bd7f303d820a2cf2d97e486f8a33f9549c"
   },
   "source": [
    "Build new model with new params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "8660f5045ae488929feefb10ae70fcf90b9d128b"
   },
   "outputs": [],
   "source": [
    "X_train = train_df.drop(['holiday', 'datetime', 'temp', 'atemp', 'humidity'], axis=1)\n",
    "X_test = test_df.drop(['holiday', 'datetime', 'temp', 'atemp', 'humidity'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "ad770cd15e521d59a0f4d5cd9f5099065fd2bf32"
   },
   "outputs": [],
   "source": [
    "X_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "d2bb1c85325a9ef51b91742efdb56a6334f4a8d3"
   },
   "outputs": [],
   "source": [
    "cat_features = [0, 1, 2, 7, 8, 9, 10, 11]\n",
    "\n",
    "X_train_cbr, X_test_cbr, y_train_cbr, y_test_cbr = train_test_split(X_train, y_train, test_size=0.3, random_state=17, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "5c938615cc3474ebb5363f638b92cd1d0d7a7c66"
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "obj_call_count = 0\n",
    "cur_best_rmse = np.inf\n",
    "\n",
    "trials = Trials()\n",
    "best = fmin(fn=objective,\n",
    "                     space=space,\n",
    "                     algo=HYPEROPT_ALGO,\n",
    "                     max_evals=N_HYPEROPT_PROBES,\n",
    "                     trials=trials,\n",
    "                     verbose=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "bef13a23bb5db25a1011b53194eb15f230db83fa"
   },
   "outputs": [],
   "source": [
    "print('The best params:')\n",
    "print(best)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "0b850402bf492e3b181883761886ef5f111a6646"
   },
   "outputs": [],
   "source": [
    "cbr = CatBoostRegressor(random_seed=17, \n",
    "                        eval_metric='RMSE', \n",
    "                        iterations=2000, \n",
    "                        max_depth=best['depth'], \n",
    "                        early_stopping_rounds=50, \n",
    "                        learning_rate=best['learning_rate'],\n",
    "                        l2_leaf_reg=best['l2_leaf_reg'])\n",
    "cbr.fit(X_train_cbr, y_train_cbr['count'], \n",
    "       eval_set=(X_test_cbr, y_test_cbr['count']), \n",
    "       cat_features=cat_features,\n",
    "       silent=True,\n",
    "       plot=True);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "19cdad8c295a01952ce854fcf64337dbaaafa8e2"
   },
   "outputs": [],
   "source": [
    "rmse_learn = cbr.evals_result_['learn']['RMSE']\n",
    "rmse_test = cbr.evals_result_['validation_0']['RMSE']\n",
    "\n",
    "plt.plot(rmse_learn)\n",
    "plt.plot(rmse_test)\n",
    "plt.title('RMSE on train/test data')\n",
    "plt.xlabel('trees count')\n",
    "plt.ylabel('rmse value')\n",
    "plt.legend(['leanr', 'test']);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "5cf2d54dd11e1df9343328f107abd83d28076908"
   },
   "source": [
    "### Part 11. Prediction for test or hold-out samples "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b6ced0690558c0b9d091f64cedfd618ddd0262e0"
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "cbr = CatBoostRegressor(random_seed=17, \n",
    "                        eval_metric='RMSE', \n",
    "                        iterations=2000, \n",
    "                        max_depth=best['depth'], \n",
    "#                         early_stopping_rounds=50, #because fit without eval_set\n",
    "                        learning_rate=best['learning_rate'],\n",
    "                        l2_leaf_reg=best['l2_leaf_reg'])\n",
    "cbr.fit(X_train, y_train['count'], \n",
    "       cat_features=cat_features,\n",
    "       silent=True,\n",
    "       plot=False);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "fbf4901ed0bbd9bea75a906620a02a323ceb2032"
   },
   "outputs": [],
   "source": [
    "y_pred = cbr.predict(X_test)\n",
    "rmse = np.sqrt(mean_squared_error(y_test['count'], y_pred))\n",
    "print('RMSE = ', rmse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "0f2c3649a9bc0e54d5f79bf99988b1bbae1be8f1"
   },
   "source": [
    "**RMSE for hold_out samples coincides with the RMSE on cross-validation!!!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "de459a28d7a1068611012d373a31993558c72327"
   },
   "source": [
    "### Part 12. Conclusions \n",
    "* New features significantly improved the result. 'cbr.feature_importances_' shows us the most important parameters. If you select new features from them, you can get a better model.\n",
    "* Some features can degrade the result and it would be necessary to find them and remove them from the model.\n",
    "* Here you can also apply an ensemble of models based on different targets ('casual', 'registered' and 'count'). Also try other libraries (lightgbm, XGBoost) and compare results. And can also apply an ensemble of models based on different libraries )))\n",
    "* The implemented functions **objective**  and **get_catboost_params** for setting hyperparameters can be used in other projects. I think this is a good start for creating a more flexible library that can be used in Kaggle competitions\n",
    "* I used CPU, but GPU can be used in catboost. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
